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Abstract 

Feature weighting algorithms try to solve a problem of great impor¬ 
tance nowadays in machine learning: The search of a relevance measure 
for the features of a given domain. This relevance is primarily used for 
feature selection as feature weighting can be seen as a generalization of it, 
but it is also useful to better understand a problem’s domain or to guide 
an inductor in its learning process. Relief family of algorithms are proven 
to be very effective in this task. Some other feature weighting methods 
are reviewed in order to give some context and then the different existing 
extensions to the original algorithm are explained. 

One of Relief’s known issues is the performance degradation of its 
estimates when redundant features are present. A novel theoretical def¬ 
inition of redundancy level is given in order to guide the work towards 
an extension of the algorithm that is more robust against redundancy. A 
new extension is presented that aims for improving the algorithms perfor¬ 
mance. Some experiments were driven to test this new extension against 
the existing ones with a set of artificial and real datasets and denoted that 
in certain cases it improves the weight’s estimation accuracy. 


1 Overview 

Feature selection is undoubtedly one of the most important problems in machine 
learning, pattern recognition and information retrieval, among others. A feature 
selection algorithm is a computational solution that is motivated by a certain 
definition of relevance. However, the relevance of a feature may have several 
definitions depending on the objective that is looked after. 

The generic purpose pursued is the improvement of the inductive learner, 
either in terms of learning speed, generalization capacity or simplicity of the 
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representation. It is then possible to understand better the obtained results, di¬ 
minish the volume of storage, reduce noise generated by irrelevant or redundant 
features and eliminate useless knowledge. 

On the other hand, feature weighting algorithms try to estimate relevance 
(in the form of weights to the features) rather than binarily deciding whether 
a feature is either relevant or not. This is a much harder problem, but also a 
more flexible framework from an inductive learning perspective. This kind of 
algorithms are confronted with the down-weighting of irrelevant features, the 
up-weighting of relevant ones and the problem of relevance assignment when 
redundancy is an issue. 

In this work we review Relief, one of the most popular feature weighting 
algorithms. After a state-of-the-art in section focused on feature weighting 
methods in general, in section we describe the algorithm and its more impor¬ 
tant extensions. We are primarily interested in coping with redundancy, and 
studying to what extent can the Relief algorithm be modified in order to better 
its treatment of redundancy, which is one of its known weaknesses. In this vein, 
section[3]points out a novel and general (though computationally infeasible) def¬ 
inition of redundancy level and try to relate it to the actual Relief performance. 
Next, we develop a "double" or feedback extension of the algorithm that takes 
its own estimations into account in order to improve general performance. We 
also complement this matter with a set of experiments in section 01 The work 
concludes with some open questions and clear avenues of continuation of the 
material herein presented. 


2 State of the art 

2.1 Introduction 

In the last few years feature selection has become a more and more common 
topic of research. This popularity increase is probably due to the growth of the 
problem domains’ number of features. No more than ten years ago few problems 
treated domains with more than 50 features. Nowadays most papers deal with 
domains with hundreds and even tens of thousands of features. New techniques 
have to be developed to address this kind of problems with many irrelevant and 
redundant features and comparatively few instances to learn from. One example 
of these new domains is web page categorization, a domain currently of much 
interest for internet search engines where thousands of terms can be found in 
a document. Another example can be appearance-based image classification 
methods which may use every pixel in the image. Classification problems with 
thousands of features are very common in medicine and biology; e.g. molecule 
classification, gene selection or medical diagnostics. In medical problems we 
typically have less than a hundred patients and for each patient we can have 
thousands of features evaluated. 

Feature selection can help us solving a classification problem with these char¬ 
acteristics for many reasons. Firstly it may make the task of data visualization 
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and understanding easier by eliminating irrelevant features which can mislead 
the interpretation of the data. It can also reduce the cost of the measurements 
as we can avoid measuring irrelevant features; this is especially important in 
domains where some features are very expensive to obtain, e.g., require a spe¬ 
cial medical test. In addition, a big benefit of feature selection is defying the 
curse of dimensionality to help the induction of good classifiers from the data. 
When many unuseful, i.e. irrelevant or redundant, features are present in train¬ 
ing data, classifiers may find false regularities in the input features and learn 
from that instead of learning from the features that really determine the in¬ 
stance class (also valid when predicting the instance target value in the case of 
regression). 

There are two main approaches to feature selection: filter methods and 
wrapper methods. Both methods can be included in the framework shown 
on Fig. [T] The main difference between them is the use of a classifier for the 
estimation of a feature usefulness. The two families of methods only differ in 



Figure 1: Feature selection framework 

the way they evaluate the candidate sets of features. While the former methods 
use a problem independent criterion, the latter use the performance of the final 
classifier to evaluate the quality of a feature subset. The basic idea of the filter 
methods is to select the features according to some prior knowledge of the data. 
For example, to select the features based on the conditional probability that 
a given instance is a member of a certain class given the value of its features. 
Another criterion commonly used by filter methods is the correlation of a feature 
with the class, i.e. selecting features with high correlation. More detailed criteria 
is given in section where also more criteria are described. In contrast, 
wrapper methods suggest a set of features that are given to a classifier which 
uses them to classify some training data and returns the performance of the 
classification which is the acceptance criterion of the feature set. 

Now we have explained two approaches of feature subset evaluation, but 
is clear to see that if we had to test all possible subsets, using either of the 
methods, of features we would have a combinatorial explosion. If our initial 
set of features is JF and |-F| = n, the number of evaluations we would have to 
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do would be equal to the cardinality of the power set of T\ \V{T)\ = 2". For 
this reason diverse techniques have been developed to reduce the computational 
complexity of this problem. 

A different technique of determining feature usefulness apart from feature 
selection is a technique called feature weighting (or feature ranking). It consists 
of assigning a numeric value to each feature so as to indicate the feature’s 
usefulness. Feature weighting can help solving the problem of feature selection. 
One possible approach to feature selection using feature weighting could be 
to first assign weights to features and then choose features according to their 
weights. This can be done either by having a rule to binarize the weights, e.g. 
select all the features with weight greater than zero, or by means of a weight 
guided feature subset evaluation, e.g. evaluating the subsets containing the 
features with greatest weight values. In fact, feature weighting could be seen 
as a generalization of feature selection, i.e. feature selection would be a specific 
kind of feature weighting where the weights assigned to features are binary. 

In following sections we will explore various methods of existing feature 
weighting algorithms than and will discuss their properties to later have some 
starting point to describe and analyze the algorithm in the focus of this paper: 
Relief. 


2.2 Feature weighting 

This section will review some of the most used feature weighting algorithms. 
Although the section is focused on feature weighting, most of the methods de¬ 
scribed below can also be used for feature selection. 

On following subsections X and T represent the sets of instances and features 
respectively. /, Ii or X represent instances from I. X, Xi oi Y are sets of 
possible feature values from a feature in X. C represents the set of possible class 
values. And their lower case versions represent single value in its correspondent 
upper case set, e.g. we will use c € C and x G X. We also will use a short 
notation to express probabilities, e.g. will write p(x) to represent the probability 
for feature X to have value x or p{c\x) to express the conditional probability of 
the class to have value c knowing that the feature X has value x. 


Conditional Probabilities based methods 


The first group of methods we will look at are the ones based on conditional 
probabilities of class given a feature value. Two simple methods using this idea 
were introduced in jCreecy et ah, 1992 : per-category feature importance and 
cross-category feature importance (or, in short, PCF and CCF). One important 
limitation is that they can only deal with binary features, so numerical features 
must be discretized and symbolic features converted to a group of binary fea¬ 
tures. The weights assigned to features in the case of PCF depends on the class 
of the feature as seen in Eq. 12.11 


wpcf{X,c) = P{c\x), where x would be the positive feature value (2.1) 
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so we have a weight for each feature and class. CCF relies on the same idea but 
instead of having one weight for each feature and class it have only a weight per 
feature. It does so by averaging the weights across classes. In fact, as it shows 
Eq. 12.21 it uses the summation of squares of conditional probabilities. 

wccf{X) = E P(c|a;)^, where x would be the positive feature value (2.2) 

cGC 

Later on [Mohri and Tanaka, 1994| showed that PCF is too sensitive to class 
proportions and tends to answer the most frequent class when using it for clas¬ 
sifying. 

A more sophisticated approach that also makes use of conditional proba¬ 
bilities is the one used by the value difference method (VDM) introduced by 
[Stanfill and Waltz, 1986| . This time no binarization of features is required, 
although numeric features still have to be discretized in order to calculate con¬ 
ditional probabilities as shown in Eq. 12.31 In addition this method does not 
assign weights to each feature but to each value of each feature. 


wvdm{X, x) = 




E 

cGC 


P{x\c) 

p{x) 


(2.3) 


This weighting scheme was originally used to calculate distances between fea¬ 
tures. 

Finally we have Gini-index gain [Breiman et ah, 1984 in Eq. 12.41 which can 
be interpreted as the expected error rate 


GG{X) = ^ P{x) ^ Pic\xf - Y, Pic? 
x^x cec cgc 


(2.4) 


and is proven to be biased towards multiple valued features. In further sec¬ 
tions we will see that this particular measure has some relation with the Relief 
algorithm. 


Information theory based methods 


Not all the feature weighting methods are based on conditional probabilities, 
though. Now we will describe some methods based on information theory 
[Shannon, 1948 Shannon and Weaver, 1949 . 


The first one is just using Shannon’s mutual information (MI) between two 
features X and Y in Eq. 12.51 


MI{X,Y) = H{X)-HiX\Y)= Y Pi^^y?cg2 


x£X,y£Y 


pix,y) 

p{x)p{y) 


(2.5) 
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which is defined using entropies and conditional entropies (see Eq. \m . 


Entropy: 


H{X) = - ^ Fix) log 2 P{x) (2.6) 


xex 


Conditional entropy: 
Joint entropy: 


H{X\Y) = H{X, Y) - H{Y) (2.7) 


H{X,Y) = - Y. P{x,y)\og^P{x,y) (2.8) 


x^X.y^Y 


to weight features. A more informal but maybe more intuitive definition of mu¬ 
tual information is that MI measures the information of X that is also in E. If 
the features are independent no information is shared so mutual information is 
zero. In the other end we have that one feature is an exact copy of the other, all 
the information it contains is also shared by the other so the mutual information 
is the same as the information conveyed by one of them, namely its entropy. 
A very popular feature weighting method uses the idea of mutual information. 
It was proposed by [Hunt et ah, 1966) and it is used in [Quinlan, 1986| when 
splitting nodes in top down indutcion of decision trees (TDIDT) best known 
as IDS. The term information gain (IG) in Eq. 12.91 is used there. Its intuitive 
interpretation would be: The more an feature reduces class entropy when know¬ 
ing its value, the more its weight. This is just another way to say: The more 
information is shared between an feature and the class, the more its weight. So 
if we have a set of classes C we can define IG for the class knowing the value of 
a feature X as shown in Eq. 12.91 


IG{C\X) = MI{C,X). 


(2.9) 


Later on, similar methods were introduced to reduce the bias of IG towards 
features with large number of values. The extreme case is using an feature 
with an ID code. It is clear to see that knowing the ID code we can precisely 
know the class of any instance in our training set. The problem is that we can 
say nothing about a new instance which will have another unknown ID code. 
One of these methods is gain ratio (GR) in Eq. 12.101 used by C4.5 decision 
tree induction algorithm [Quinlan, 1993| which normalizes IG by the amount of 
information needed to predict an features value (the entropy of the feature). But 
there are also various other proposals, among them there are entropy distance 
[MacKay, 2003[ in Eq. 12.111 and the MAjntaras distance between the class and 
the feature in Eq. I2.12l which was proved to be unbiased towards multiple-valued 
features. 



( 2 . 10 ) 

( 2 . 11 ) 

( 2 . 12 ) 


Dh{C, X) = H{C, X) - MI{C, X) 



6 



















Distribution distance based methods 

Another way to find dependencies between a feature and the class is to measure 
differences between their distributions. Perhaps the simplest way to do so is 
to compute the difference between the joint and the product distributions as 
shown in Eq. 12.131 


Diff(C,X)= ^ \Pic,x)-Pix)Pic)\ (2.13) 

ccC.xex 


and this distance can be directly used as the features weight. Large differences 
between the joint and the product distributions indicate large dependency of 
the class on the feature, so the feature should be given a large weight. This can 
easily be applied to continuous features changing the sum for an integration. 
It can also easily be rescaled to the [0,1] interval as it has an upper bound of 

More distance functions can be used here. An interesting one is the Kullback- 
Leibler divergence which is not a distance in fact as it is not symmetric (i.e., 
Dkl{X\\Y) 7 ^ Dkl{X\\X)) . The application on feature weighting is to have the 
weigh be equal to the distance between the joint and the product distributions, 
see Eq. 12.141 


Dkl{P{X,C)\\P{X)P{C)) 


^ P(c, a;) log 

ceC.aiGX 


P{c,x) 

P{x)P{c) 


(2.14) 


Note that this is exactly the same as the mutual information between the feature 
and the class (see Eq. 12.51) so we have DuhiPiX^ C)\\P{X)P{C)) = MI{X, C). 


Correlation based methods 

Even though this approach to feature weighting is treated last, maybe is one 
of the simplest as it does not care about continuous feature discretization or 
probability density estimations. It is usual in statistics to construct contingency 
tables for pairs of discrete variables to analyze their correlation. In our case (see 
Table [IJ we will define a contingency table between the set of classes Ci € C 
and the values of a feature Xj G X. The inner cells in row i and column j of the 
table contain the number of instances of class that have feature X = Xj. The 
row marginal totals will tell the number of instances for the corresponding class 
and the column marginal totals the number of instances with the corresponding 
value on feature X. Finally the sum of either marginal totals should be the 
total number of instances m. Looking at this table we can define chi-squared 
weight for feature X as shown on Eq. 12.151 

X^{X)= Y. (2.15) 

where Ecx is the expected number of instances of class c with value x on feature 
X calculated as Nc-N.^/m. X^ is distributed approximately as a with (u — 
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Xi 

X2 


Xy 

Tot. 

Cl 

Nn 

Ni2 



Ni. 




C-w 

Nwi 

Nw2 


N 

’ wv 

N^. 

Tot. 

N.i 

N.2 


N.X 

m 


V No. of values for X 

w No. of classes (C) 


m Total no. of instances 

Nc^. Total no. in class c 

N-Xj Total no. with X = Xj 
NciXj No. with C = c A X = Xj 

Table 1: Contingency table of the class vs. the X feature values 


l)(r(; — 1) degrees of freedom. We should avoid terms with Ecx = 0 or replace 
them with a small positive number. We can see that in the extreme case that 
X and C are completely independent N^x = E^x is expected so large values of 
X'^(X) indicate strong dependence between the feature and the class. Note that 
the result of X^ depends not only on the joint probabilities P{c, x) = Ncx/m but 
also depends on the number of instances m. This dependency on the number 
of instances seems to make sense with the intuition that correlations calculated 
with small number of instances shall be less accurate. 

2.3 Relief 

One common characteristic of the previously cited methods is that they treat 
features individually assuming conditional independence of features upon the 
class. In the other hand, Relief takes all other features in care when evaluating 
a specific feature. Another interesting characteristic of Relief is that it is aware 
of contextual information being able to detect local correlations of feature values 
and their ability to discriminate from an instance of a different class. 

The main idea behind Relief is to assign large weights to features that con¬ 
tribute in separating near instances of different class and joining near instances 
belonging to the same class. The word "near" in the previous sentence is of 
crucial importance since we mentioned that one of the main differences between 
Relief and the other cited methods is the ability to take local context into ac¬ 
count. Relief does not reward features that separate (join) instances of different 
(same) classes in general but features that do so for near instances. 

In Fig. [^we can see the original algorithm presented by Kira and Rendell 
in [Kira and Rendell, 1992| . We maintained the original notation that slightly 
differs from the used above as now features (attributes) are labeled A. There we 
can see that in the aim of detecting whether the feature is useful to discriminate 
near instances it selects two nearest neighbors of the current instance Ri. One 
from the same class H called the nearest hit and one from the different class 
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Input: for each training instance a vector of feature values and the class value 
Output: the vector W of estimations of the qualities of features 

1. set all weights W[A] := 0.0; 

2. for i := 1 to m do begin 

3. randomly select an instance Rv, 

4. find nearest hit H and nearest miss M; 

5. for A := 1 to a do 

6. W[A] := W[A] — diff(A, Ri, H)/m + diff(A, Ri, M)/m 

7. end; 


Figure 2: Pseudo code of the original Relief algorithm 


M (the original Relief algorithm only dealt with two class problems) called the 
nearest miss. With these two nearest neighbors it increases the weight of the 
feature if it has the same value for both Ri and H and decreases it otherwise. 
The opposite occurs with the nearest miss, Relief increases the weight of a 
feature if it has opposite values for Ri and M and decreases it otherwise. 

One of the central parts of Relief is the difference function diff which is also 
used to compute the distance between instances as shown in Eq. 12.161 

5{hj2)=Y.diS{A,,h,l2) (2.16) 


The original definition of diff was an heterogeneous distance metric composed of 
the overlap metric in Eq. l2.17l for nominal features and the normalized Euclidean 
distance in Eq. 12.181 for linear features, which |Wilson and Martinez, 1997| 
called HEOM. 


dm{A,h,i2) 


0 if value(A, Ii) = value(A, I 2 ) 
1 otherwise 


(2.17) 


diff(A,/i,/2) 


|value(A, Ii) — value(A, l 2 )\ 
max(A) — min(A) 


(2.18) 


The difference normalization with m guarantees that the weight range is [-1,1]. 
In fact the algorithm tries to approximate a probability difference in Eq. 12.201 

W[A] R::!P(different value of Tljnearest instance from different class)— (2.19) 
P(different value of Ajnearest instance from same class) (2.20) 


We can see that for a set of instances I having a set of features P this algorithm 
has cost 0{m x jPj x |P|) as it has to loop over m instances. For each instance in 
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the main loop it has to compute its distance from all other instances so we have 
0{m X \I\) times the complexity of calculating Djieiief and we can easily see 
from Eq. I2.16l that its complexity is 0{\F\), so we have our complexity: 0(rn x 
|I| X As TO is a user defined parameter we can in some measure control 

the cost of Relief algorithm having a tradeoff between accuracy of estimation 
(for large to) and low complexity of the algorithm (for small to). However to 
can never be greater than \X\. 

2.4 Extensions of Relief 

The first modification proposed to the algorithm is to make it deterministic 
by changing the outer loop through to randomly chosen instances for a loop 
over all instances. This obviously increases the algorithms computation cost 
which becomes 0(|lp x |J^|) but makes experiments with small datasets more 
reproducible. Kononenko uses this simplified version of the algorithm in its 
paper [Kononenko, 1994] to test his new extensions to the original Relief. This 
version is also used by other authors [Kohavi and John, 1997| and its given the 
name Relieved with the final d for "deterministic". 

We can find some extensions to the original Relief algorithm proposed in 
[Kononenko, 1994| in order to overcome some of its limitations: It couldn’t deal 
with incomplete datasets, it was very sensible to noisy data and it could only 
deal with multi-class problems by splitting the problem into series of 2-class 
problems. 

To able Relief to deal with incomplete datasets, i.e. that contained missing 
values, a modification of the diff function is needed. The new function must be 
capable of calculating the difference between a value of a feature and a missing 
value and between two missing values in addition to the calculation of difference 
between two known values. Kononenko proposed various modifications of this 
function in its paper and found one that performed better than the others it 
was the one in a version of Relief he called RELIEF-D (not to be confused with 
Releaved mentioned above). The difference function used by RELIEF-D can be 
seen in Eq. 12.211 

{ 1 — P{value{A, l 2 )\class{Ii)) if h is missing 

1 - X; [P{a\cla.ss{Ii)) X P{a\class{l 2 ))] if both missing 

aeA 

( 2 . 21 ) 

Now we will focus on giving Relief greater robustness against noise. This 
robustness can be achieved by increasing the number of nearest hits and misses 
to look at. This mitigates the effect of choosing a neighbor that would not 
have been the nearest without the effect of noise. The new algorithm has a 
new user defined parameter k that controls the number of nearest neighbors to 
use. In choosing k there is a tradeoff between locality and noise robustness. 
[Kononenko, 1994| states that 10 is a good choice for most purposes. 

The last limitation was that the algorithm was only designed for 2-class 
problems. The straightforward extension to multi-class problems would be to 
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take as the near miss the nearest neighbor belonging to a different class. This 
variant of Relief is the so-called Relief-E by Kononenko. But later on he proposes 
another variant which gave better results: This was to take the nearest neighbor 
(or the k nearest) from each class and average their contribution so as to keep 
the contributions of hits and misses symmetric and between the interval [0,1]. 
That gives the Relief-F (ReliefF from now on) algorithm seen in Fig. [31 


Input: for each training instance a vector of feature values and the class value 
Output: the vector W of estimations of the qualities of features 

1. set all weights W[A\ := 0.0; 

2. for t := 1 to m do begin 


3. randomly select an instance Ri, 

4. find k nearest hits Hf, 

5. for each class C ^ class{Ri) do 

6. find k nearest misses Mj{C)\ 

7. for A-= 1 to a do 

8. W[A] :=1T[A]- diS{A,Ri,Hj)/{m- k) + 


9. 


E 

C^class(Ri) 


E diff(Ai?»M,(C)) 

i=i 


/(m • /c); 


10. end; 


Figure 3: Pseudo code of the ReliefF algorithm 


The above mentioned relation to im purity functions, in specific with Gini- 


index gain in Eq. 12.41 can be seen in Robnik-Sikonja and Kononenko, 2003 
when developing the probability difference in Eq. I2.20l in the case that the algo¬ 
rithm uses a large number of nearest neighbors (i.e., when the selected instance 
could be anyone from the set of instances). This version of the algorithm is 
called myopic ReliefF as it loses its context of locality property. Rewriting Eq. 
12.201 by removing the neighboring condition and by applying Bayes’ rule, we 
obtain Eq. 12.221 


Tj./r ^1 d^sa7necl\eqval^eqval (f 

VV [AJ = - : 

^samecl 


cl\eqval) ^eqval 


1-P< 


samecl 


( 2 . 22 ) 
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For sampling with replacement we obtain we have: 


= E^(^ 


Peqval — ^ 

cGC 


P< 


samecl\eqval 


= E 


P{x) 




^P{c\xf 


x^X \^xGX - 

Now we can rewrite Eq. 12.221 to obtain the myopic Relief weight estimation: 


W'\A\ = Peqval X GG'(X) 

^ ' Psamecll - Psamecl 


(2.23) 


Where GG'{A) is a modified Gini-index gain of attribute A as seen in Eq. 12.241 


GG'(X) = Y, 

x^X 



P{x)^ 

■,^xPi^r 


X E^(c|a;)^ 
cec 


cGC 


(2.24) 


As we can see the difference in this modified version from its original Gini-index 
gain described above in Eq. 12.41 is that Gini-index gain used a factor: 


Pjx) 

SxeJf Pi^) 


P{x) 


while myopic ReliefF uses: 

P{x)^ 

Ex^xPi^r 

So we can see how this myopic ReliefF in Eq. 12.231 holds some kind of 
normalization for multi-valued attributes when using the factor Peqval- This 
solves the bias of impurity functions towards attributes with multiple values. 
Anther improvement compared with Gini-index is that Gini-index gain values 
decrease when the number of classes increase. The denominator of Eq. 12.231 
avoids this strange behavior. 


3 New apportations 

3.1 Redundancy analysis 

To begin with the redundancy analysis of Relief, we first of all have to define 
exactly the meaning of redundancy. In general the definitions of redundancy 
we find in the literature are based on feature correlation, i.e. two features are 
redundant if their values are correlated. One interesting particular case is when 
one feature is an exact copy of another so their values are completely corre¬ 
lated, one feature is obviously redundant. But in reality a feature may not be 
completely correlated with another feature but may be (partially) correlated 
with a set of features. In such case it’s not straightforward to determine redun¬ 
dancy. We can take as an example the features shown in Table |2] The feature 
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fl 

/2 

fr 

c 

0 

0 

1 

0 

0 

1 

1 

0 

1 

0 

1 

0 

1 

1 

0 

1 


Table 2: Two relevant and one redundant features: C = /i A/2 and /r = /i A /2 

fr is intuitively redundant with the set {fi, f 2 } but is not correlated with any of 
them, so it would not be redundant according to the correlation based definition 
of redundancy. So we have to find a better definition for feature redundancy 
that enables us to identify not only pairs of redundant features but features 
redundant with any set of other features. Before giving the formal definition of 
redundancy let’s introduce some previous definitions: 

Definition 3.1 Let U = { a,/3,...} be a set of discrete variables in a problem 
domain. Each variable is associated with a set of possible values. A configuration 
or a tnple u' of U' C U is an assignment of values to every variable in . 

Definition 3.2 A probabilistic domain model (PDM) P over U deter¬ 
mines the probability P(u') of every tuple u' of\]' for each U' C U. 

Definition 3.3 For three disjoint subsets X, Y and Z C U, X and Y are said 
to be conditionally independent given Z under P, noted I{'X.,Z,Y)p or 
simply /(X, Z, Y) from now on, if (see \Pearl, 198^ pp 83-97]) 

/(X, Z, Y) = P(x|y, z) = P(x|z) whenever P{y,z) > 0 (3.1) 


Using this notation we can express unconditional independence as /(X,0, Y), 
i.e., 

/(X,0, Y) = P(x|y) = P(x) whenever P{y) > 0 

Note that /(X, Z, Y) implies the conditional independence of all pairs of vari¬ 
ables a € X and /3 € Y, but the converse is not necessarily true. 

Definition 3.4 A Markov Blanket BL/ (a) of an element a € XJ is any 

subset S C U for which (see \Pearl, 1988] ) 

/(a, S,U —S —a) andajhS. (3.2) 


An intuitive interpretation of Def. 13.31 would be: Once Z is given, the probability 
of X will not be affected by the discovery of Y. Or Y is irrelevant to X once we 
know Z. Note that the Markov blanket condition in Def. is stronger than 
conditional independence. It is saying that not only that knowing a is irrelevant 
to the class, but also to the rest of the features, so S has all the information 
that a has about C and all the information a has about U — S — a. This takes 
us to our definition of redundancy: 
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Definition 3.5 Given a set of features F and a class feature C, a rednndant 
featnre a G F is a feature for which exists a Markov blanket S = BL/(a) 
within {F, C} such that S C F. 


An interesting property of Markov blankets is that if we removed a feature 
a such that existed BL/(a) C U and now we are eliminating another feature j3 
such that exists BL7(/3) C U — a then we can prove that also exists BL/(a) C 
U — /3, we can see the proof in Roller and Sahami, 1996 . That is, a redundant 
feature remains redundant when other redundant features are removed. So if we 
proceed to remove features using this criterion, we will never have to reconsider 
our decisions. 

Unfortunately, there we rarely find a fully redundant feature, but rather one 
that its information is nearly subsumed by other features. So we would like 
to know not only whether a feature is redundant or not but its redundancy 
grade. We would like a function R' which given an feature a S U and a set 
of features \5 & lA gives us a degree of redundancy of this feature to the set. 
Ideally we would like a function i?' : U x [0,1] than satisfies the following 
propositions: 


i?'(a,BL/(a)) = 1 
i?'(a,U-a,) < i?'(a,U),Vai e U 

To achieve this we should change the boolean definition of conditional inde¬ 
pendence to a some function of P(x|y, z) and P(x|z). 


Definition 3.6 If we have that: U is our set of features, a is the feature we 
are evaluating, and S is some subset o/U not containing a. We defined u as a 
configuration o/U. We will write Su, and for the configuration ofS, the 
configuration of\J — S — a and the value of a respectively when the configuration 
of U is n. Now we can define lA as the set of all possible configurations of U 
for which P(u — Su — Ou, §„) > 0. 

With all that, we define Rednndancy level R' as: 


R'ia.XJ) = 1 — max 

SCU-a 


Suew |■f^(Q:u|su) -P(o^u|Su , 


Note that the calculation of this redundancy level is exponential in the num¬ 
ber of features in our set, as it compares the conditional probabilities of all 
possible subsets of U, so the max function will have to compare |7^(U)| = 
terms. And for each subset we also have an exponential cost in the number of 
values of the features because the sum is over each configuration n of U. 

It is clear to see that, although Eq. 13.61 gives an intuitively consistent defi¬ 
nition of redundancy level, its computational cost might be too large for R' to 
be directly applied in a feature weighting (or feature selection) algorithm. We 
should use an estimation of R' that maximized the tradeoff between accuracy 
and complexity. But in fact the aim of the definition of R' was not to have an 
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efficient algorithm to calculate the redundancy level of a feature. The definition 
had three basic (related) objectives: first of all to provide a suitable formal def¬ 
inition of redundancy in order to study the effect of feature redundancy in the 
different existing algorithms, for instance ReliefF. And second to serve as some 
starting point for new extensions to methods which performance decreases in 
the presence of redundant features, again Relief is an example. And finally, to 
direct the developing of new algorithms that effectively and efficiently estimate 
redundancy. 

3.2 Double Relief 

When more and more irrelevant features are added to a dataset the distance 
calculation of Relief degrades its performance as instances may be considered 
neighbors when in fact they are far from each other if we compute its distance 
only with the relevant features. In such cases the algorithm may lose its context 
of locality and in the end it may fail to recognize relevant features. 

The diS{Ai, Ii, I 2 ) function calculates the difference between the values of 
the feature Ai for two instances R and l 2 . Sum of differences over all features 
is used to determine the distance between two instances in the nearest hit and 
miss calculation (see equation 12.161) . 

As seen in the k-nearest neighbors classification algorithm (kNN) many 
weighting schemes which assign different weights to the features in the cal¬ 
culation of the distance between instances (see eauation l3.3l) . 

a 

5'{Il,l2) = w{A,)m{A,,h,l2) (3.3) 

i=l 

In the same way that in [Wettschereck et ah, 1997] Relief’s estimates of fea¬ 
tures’ quality have been used successfully as weights for the distance calculation 
of kNN we could use their estimation in the previous iteration to compute the 
distance between instances while searching the nearest hits and misses. We 
will refer to this version of ReliefF as double ReliefF or in short dReliefF. The 
problem using the weights estimates could be that in early iterations these es¬ 
timations could be too biased to the first instances and could be far from the 
optimal weights. So, for small t, is very different from FF[Ai]t. 

What we want is to begin the distance calculation without using the weight 
estimates and then, as Relief’s weight estimates become more accurate (because 
more instances have been taken into account), increase the importance of these 
weights in the distance calculation. Lets have a distance calculation like the one 
in equation 13.41 

a 

S{hj2) = Y .f{W{A,)t,t) diff(A„ luh) (3.4) 

i=l 

We would like a function / : K x (0, 00 ) —>■ K such that: 

• f{w,t) is increasing with respect to t 
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• is continuous 


• f{w,0) = 1 

• f{w, oo) = w 

One such function could be the one in equation 13.51 And we will refer to the 
version of ReliefF using this distance equation as progressively weighted double 
relief or in short pdReliefF. 


f{w,t)= (3.5) 

Where T is a control parameter that determines the steepness of the curve 
described by / (see figure |4]) . Another desirable property for our function would 



Figure 4: Plot of function / for 10 instances with w = 0.5 

be that it always gives the same results regardless of the number of iterations. 
In other words, if m is the total number of iterations, we would like f{w, m) to 
be the same value whatever the value of m. To achieve that we must vary the 
value of T according to the total number of iterations so as to decrement the 
steepness of the function as the number of total iterations increases. The value 
of T for f{w, m) to be the same is T = 2/ log(m). In figure^we can see how / 
varies the influence of different weights (even a non realistic one that is greater 
than 1) as iterations go on. We can see that with this value for T the function 
converges in the first few iterations and then it stabilizes its value near w. For 
problems with many iterations a softer function may be tried if values converge 
prematurely. 
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Figure 5: Plot of function / for 10 instances with T = 2 


4 Empirical results 

To begin with the empirical results we have to define a measure of success for the 
weights estimations. First of all we need to have a success criterion. For prob¬ 
lems where we know which of the features are important (e.g., artificial datasets) 
some we can use this knowledge to evaluate estimates. In [Kononenko, 1994| 
separability and usability, two more indicators may be useful in case of negative 
separability - minimality and completeness - which can help in determining the 
quality of the given solution. See more precise definitions below. 

separability Shows the ability of the weight estimates to distinguish between 
important and unimportant features. Positive separability (s > 0) means 
that important features are correctly separated from unimportant ones. 

s = e[-2,2] 

usability Shows the ability of the weight estimates to distinguish on of the 
important feature from the unimportant ones. Positive usability {u > 0) 
means that almost one of the important features is correctly separated 
from unimportant ones. 

« = G[-2,2] 

minimality Shows the ratio of important features in the minimum set of fea¬ 
tures that contains all the important features if we select features in de¬ 
creasing weight order. Note that s > 0 => m = 1. 

m = \I\/\M\ G (0,1] where M = {F\Wf > 
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completeness Shows the ratio of important features that we would take if 
selecting features in decreasing weight order we stopped before selecting 
the first unimportant feature. Note that again s > 0 m = 1. 

c = \C\/\I\ e (0,1] where C = {F\Wf > 

The first set of artificial problems to use is the so-called Modulo-p-I. In these 
datasets we will find I important features and R random ones. All of them 
integers in the range [0,p). The class value C is also an integer in the same 
range and can be calculated for an instance X having values Xi,X 2 , ■ ■ ■ ,Xi in 
its important features as seen on Eq. 14.11 We will test our criteria for various 
parameters of ReliefF on two different problems (Modulo-2-2 and Modulo-4-3) 
incrementally adding random features. 

C{X) = mod p (4.1) 

In Fig. [S] and [3 we can see the different behaviors of the three algorithms when 



Number of random attributes 


Figure 6: Separability for Modulo-2-2 problem with random features 


more and more random features are added. While ReliefF seems to gradually 
degrade its performance, dReliefF is more erratic and pdReliefF obtains the 
best results. This supports our theory that although it seems a good idea to 
use ReliefF’s own estimates as weights for its distance function, a bad start can 
make dReliefF’s estimates even poor than ReliefF’s. 

Another good test is CorrA1 dataset introduced in [Kohavi and John, 1997| . 
This dataset is composed of 6 features (Aq, Ai, Bq, Bi, C, I). C is 75% 
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Figure 7: Separability for Modulo-4-3 problem with random features 


correlated with the class and four other features that can fully determine the 
class of the instance when used together. The class can be expressed as: (Aq A 
Ai) V {Bq a Bi). And the last one, I, is completely random. Results are shown 
in table |3l So for CorrAl dataset the three algorithms correctly identify the 


Feature 

ReliefF 

dReliefF 

pdReliefF 

Bo 

0.259 

0.272 

0.272 

Bi 

0.197 

0.273 

0.273 

^0 

0.194 

0.277 

0.278 

Ai 

0.128 

0.277 

0.278 

C 

0.281 

0.042 

0.044 

I 

-0.141 

-0.222 

-0.222 

separability 

usability 

-0.153 

0.422 

0.230 

0.047 

0.228 

0.050 


Table 3: Weights and separability for CorrAl dataset. (With 5 nearest neigh¬ 
bors). 

irrelevant feature and rank it last, but the normal version of ReliefF give a 
larger weight to the correlated feature than it should be given. The double 
versions of the algorithm in the other hand correctly identify the four features 
that completely determine the class and give them larger weights, followed by the 
correlated one and leaving the random one last. We can see that the behavior 
of the two double versions is very similar, although the progressive weighted 
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estimation is a little more usable, it’s a little less separable. 

The next dataset (led24) is one of the LED display domain datasets from 
[S. Hettich and Merz, 1998] , In fact it is an extension of the led? dataset. The 
led? dataset consists of 7 boolean valued features (/i, ... , /y) each of them 
representing one of the light-emitting diodes contained on a LED display. They 
indicate whether the corresponding segment is on or off (see Fig. [5]). And the 
class feature has range [0,9] and coincides with the digit represented by the 
display. This dataset has another added difficulty as it has a 10% of noise in 
its features, i.e., each instance’s feature has a 10% chance of having its value 
negated. This is a quite difficult problem for classifiers and the version with 17 
unimportant features is especially difficult, e.g. a nearest neighbor classification 
algorithm falls from a 71% of classification success with the 7 feature version to 
a poor 41% with the other one. So it would be desirable for ReliefF to separate 
the important features from the rest. Tabled] shows separability and usability 


h 


h 


h 


h 

h 


Figure 8: A LED display indicating the meaning of the features 


Algorithm 

s u 

ReliefF 

dReliefF 

pdRelefF 

0.131 0.340 

0.084 0.234 

0.104 0.278 


Table 4: Separability and usability for led24 dataset. 

for this dataset. There it can be seen that the behavior for the three algorithms 
is extremely similar for this domain. All of them are able to separate the seven 
important features from the rest and even the values for s and u are almost the 
same for the three algorithms. 

Finally, the last artificial datasets to be tested are Monks datasets. They are 
interesting because even though they do not consist of lots of features, they are 
well known datasets, have interesting feature interactions and can serve us to 
compare the algorithms order of each feature with its intended ordering. There 
are three Monks datasets but we will only use Monk-1 and Monk-3 because 
Monk-2 does not contain unimportant features. They consist of six numerical 
features Ai,..., Ae with ranges varying from [1,2] to [1,4] and a boolean class 
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value. For Monk-1 the class Cmi can be calculated as Cmi = (^i = ^ 2 )V(A 5 = 
1) and the class Cms for Monk-3 as Cm 3 = (^5 = 3 AA 4 = 1 )V(A 5 ^ 4 AA 2 ^ 3). 
So for the first problem, A 4 and Aq are unimportant and among the other 
three, Ai and A 2 would help us better determine the class value than A^ as only 
one of the four possible values of A^ is important. For Monk-3 the important 
features will only be A^, A 4 and A 2 and the rest do not influence the instance’s 
class. Among these three features, A 5 and A 2 should be preferred over A 4 as 
using only the second term of the disjunct we can achieve a 97% performance. It 
is important to say that Monk-3 has a 5% of additional noise (misclassifications). 

Table [5] shows the results for the three variants of ReliefF when applied to 


Algorithm s u Feature ordering 


ReliefF 

0.26 

0.38 

Ai,A2,A5,A3,Ae,A4 

dReliefF 

0.42 

0.44 

A^,Ai,A 2 ^A^,Aq,A 4 

pdReliefF 

0.41 

0.43 

Ai,A^,A2,A‘i,AQ,A4 


Table 5: Separability, usability and feature ordering for Monk-1 dataset. 


Algorithm s u Feature ordering 


ReliefF 

0.05 

0.43 

Az,A2,A4,A'i,Ai,AQ 

dReliefF 

0.08 

0.29 

A2,A^,A4,A'i,Ai,AQ 

pdReliefF 

0.05 

0.31 

A2,A^,A4,A‘i,Ai,A(, 


Table 6: Separability, usability and feature ordering for Monk-3 dataset. 


the Monk-1 dataset. We observe that although the three algorithms correctly 
separate the important features from the unimportant ones, only ReliefF gives 
the expected ordering for the important features. The same results for Monk- 
3 dataset are shown in table [fi] For this dataset we can see how separability, 
even though positive, is very small for the three algorithms. In addition, all of 
them rank A 4 as the lowest of the important features which agrees with what 
we thought they should do. Here the double versions of the algorithms seem 
to help discriminating the important from the unimportant features, in the two 
cases they improve separability although the important feature order is worse. 
The double versions seem to increase the weight difference between important 
and unimportant features but decrease the weight difference of features in the 
same group. 

The second group of experiments is with some well known datasets from 
UCI |S. Hettich and Merz, 199'8| . These are datasets of real data, so we don’t 
know which of the features may be important and which + may be not. For this 
reason we will not be able to compute the above criteria for these datasets. So 
to evaluate the quality of the algorithms’ estimates we will use the performance 
obtained with a classifier. We will make various tests with the classifier. We 
will first of all try a classification with the feature with the greatest weight, then 
will use the two most weighted variables, end so on until all variables are used. 
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When all tests are completed we will compare the performance of the classifier 
when using all features with the performance when using the best subset found 
using Relief’s estimates. We will use the INN classifier because of its simplicity 
and sensibility to a bad choice of features. 

The first chosen dataset is the E. coli promoter gene sequences. This dataset 
contains a set of 57 nominal variables representing a DNA sequence of nu¬ 
cleotides. A promoter is a DNA sequence that enables a gene to be transcribed. 
The promoter is recognized by RNA polymerase, which then initiates transcrip¬ 
tion. For the RNA polymerase to make contact, the DNA sequence must have 
a valid conformation so that the two pieces of the contact region spatially align. 
But shape of the DNA molecule is a very complex function of the nucleotide se¬ 
quence due to the so complex interactions between them, so strong interactions 
among features are expected. In Fig. [9] we can see the results of applying fea¬ 
ture selection in the way described above for the 1-NN classifier. As can be seen 



Number of features 

Figure 9: Classification success % with for the promoter gene problem 

the results for the classification task are in general not very good, but we can 
see that for all the three versions of the algorithm the maximum performance 
is achieved when the number of used features is 2, much less than the initial 
57 features. The three versions of the algorithm agree in the first two features 
to be add (15 and 16), although the ordering in the case of normal Relief is 
inverted it selects 16 first and then 15. 

Another problem that can serve us to determine whether the weighted dis¬ 
tance calculation makes sense is the lung cancer dataset also from UCI. It con¬ 
sists of data from 32 patients suffering three different types of pathological lung 
cancers. The objective is to distinguish among the three types of cancer given 
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a set of 56 nominal features with ranges [0,3]. Authors of the dataset gave 
no information on the meaning of individual features. But probably data may 
be from different types of tests performed on patients and as there are many 
features one can venture the hypothesis that many of them may be standard 
tests that can not help in determining the patient’s type of disease. So these 
unimportant features may affect the way that ReliefF chooses the nearest neigh¬ 
bors. Moreover, it is especially important to reduce the number of features in 
this problem because the number of instances is very low compared to it so 
classifiers may be fooled by unimportant features. We can see in Fig. [TUI that 



Figure 10: Classification success % for the lung cancer problem 


Algorithm 

Max. 1-NN Performance ^ of features 

ReliefF 

dReliefF 

pdRelefF 

93.75 28 

96.88 37 

96.88 33 


Table 7: Best results obtained with 1-NN classifier for the lung cancer problem. 

in this case performance of the 1-NN classifier is significantly improved when 
we apply feature selection for this problem. While a classification using all of 
the features gives us a correct classification percent of 43.75, the bests results 
obtained with a subset of the features is above 90% with all of the versions of 
ReliefF. Although the same performance is achieved with the best subset given 
by dReliefF and the one given by pdReliefF, the results of the latter are better 
as the same performance is achieved with 4 less features. 
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5 Conclusions and future work 


In our experiments we have seen how the double versions of the algorithm helped 
in the correct feature weighting of some problems while in other cases perfor¬ 
mance is not improved and even it is diminished. An interesting property of 
these new versions of the algorithm is than they seem to help in problems where 
many irrelevant features exist, which was the initial objective. The performance 
of the algorithm improved in the modulo-p-I problems as more and more ran¬ 
dom features were added. We saw in the experiments that although ReliefF’s 
performance with few attributes was better, as the number of random features 
increased it began to decrease and for a relatively small number of random 
features dReliefF and pdReliefF overperformed the original algorithm. Further¬ 
more the performance of the latter methods did not vary with the addition of 
random features. In contrast, the results obtained with the LED dataset were 
not that encouraging. Although the dataset had more than twice random fea¬ 
tures than relevant ones the results for the three algorithms were very similar. 
This might be because the separability for this problem was so low (though 
positive) due to the difficulty of the problem (even without random features) 
for the presence of noise. The try with datasets having fewer irrelevant features, 
i.e. the Monks problems, gave very similar results for all the versions. This has 
a logical explanation: the behavior of the double version if all the attributes are 
relevant is not very different from the original one. 

The experiments with real data from the UCI Machine Learning Repository 
[Hunt et ah, 1966) that consisted in running a 1-NN classifier using successive 
subsets of features proposed by the three versions of ReliefF showed interesting 
results. The success evaluation criteria was the percentage of instances classi¬ 
fied correctly using 5-fold crossvalidation. Two datasets were chosen for this 
experiments because of their large number of features and the intuition that 
they might contain large number of irrelevant features. In both of them the 
double versions of the algorithm chose a subset of features that helped the 1-NN 
best in classifying the instances. In the case of the DNA promoters dataset the 
performance increase was not significant but this may be due to the fact that 
1-NN do not seem to be capable of solving this problem as it gave poor results 
in all cases. On the other hand, for the lung cancer dataset, we obtained signif¬ 
icantly better classifying performance with the subset from the double versions 
and, in addition, the subset found by the pdReliefF had less variables than the 
one found by dReliefF. 

We experienced almost no difference between the two double versions. This 
may be because of the progressive weighting function used. The function at¬ 
tenuates the weight estimates influence at first iterations but rapidly increases 
their influence and after the first few iterations the algorithm behaves exactly 
as dReliefF. So as a future work some other softer functions may be tested. 

Another clear line of future work is the formal study of the influence that hav¬ 
ing redundant features has to ReliefF, d ReliefF and pdReliefF. Robnik-Sikonja 


and Kononenko started this study in Robnik-Sikonja and Kononenko, 2003 


where they proved that the addition of successive copies of one feature divided 
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the initial weight ReliefF assigned to the feature among all the copies. And they 
received the same weight. But still some crucial questions have to be answered: 
Do equal weights for two features mean that features are redundant to each 
other? Does an equal sequence of weight actualizations for two features mean 
that they are redundant to each other? How can Relief be extended to diminish 
or eliminate the negative effect of redundant features? Does ReliefF compute 
some kind of approximation to R'l 
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